DESIGN AND IMPLEMENTATION OF POWER BIG DATA PLATFORM
ABSTRACT
In recent years, with the advancement of the national construction of smart grid and the development of power grid enterprise integration system, the traditional power data platform has some defect, such as insufficient scalability, repeated implementation of two sets of logic of off-line data warehouse and real-time data warehouse, highly correlated storage of data warehouse during computation, and difficulty in platform migration, which restrict the deeper mining and analysis of power big data. Therefore, this paper adopts a technical architecture incorporating distributed calculating, micro-service, flow-batch integration and lake-warehouse integration, combining the current popular Web front-end and back-end technologies such as Vue, SpringCloud and Flask, and utilizing big data components such as Hadoop, Flink, Hudi and Kafka as well as Docker containerization technology. We explore and build a power grid system operation oriented big data platform that assembles real-time computing, multi-source heterogeneous data storage, and develop APIs for authority management, information management, data analysis and visualization, machine learning and other data supporting services that assist power enterprises in managing electrical equipment and users' power consumption.With the help of this platform, power enterprises can intelligently track the equipment status, view the offline and real-time visual charts to analyze the power consumption by users, so as to have better basis in the power dispatching, maintenance and scheme adjustment of electrical market.
CHAPTER ONE
Introduction
Power data has the characteristics of multi- source, heterogeneity, large volume, rapid growth and low rate of utilization. The data can be roughly divided into three categories: monitoring data of power grid operation, marketing data and management data of power enterprise, which can be derived from smart meters, sensors and other devices as well as multi- party information platforms of power grid operation parts such as power generation, transmission and energy consumption. Various types of power data differ greatly in data structure, including structured data represented by tabular data recording equipment properties, environmental parameters and other information, unstructured data represented by video, audio, pictures and documents generated by monitoring system, and semi structured data represented by interface type data in JSON or XML format from other business databases. However, in the actual production practice, there is often a phenomenon of massive data but lack of information which can be effectively excavated. The use of the latest data processing tools and the design of an analysis applications based on big data technology can effectively improve the computational efficiency and comprehensive analysis ability of power grid data, provides more valuable information for the subsequent real- time decision- making and management, and thus brings new development opportunities for the construction of smart power grid. In the early 21st century, Google published three technical papers introducing extensible distributed file system GFS [1], programming model MapReduce for parallel analysis of large-scale datasets [2], and distributed storage system BigTable for massive data [3]. These three systems were later transformed into Hadoop's core architecture, namely the distributed file system HDFS, distributed computing framework MapReduce, and distributed database HBase. Today, Hadoop has become an indispensable component in many enterprise big data architectures. Its distributed computing and storage capabilities offer high reliability, scalability, efficiency, fault tolerance, and low cost, making itself widely applicable in resource management, job scheduling, data storage, data analysis, and other domains, therefore can meet the requirements of electric power data platforms.
In practical production scenarios, data with the same metric may need to be generated in real-time by streaming tasks as well as offline by batch tasks. Reusing the results from stream processing in batch processing can effectively reduce the workload for developers. However, in real-time scenarios, due to the issue of data disorder, the data produced by stream processing may deviate from the data obtained from batch processing, resulting in compromised data quality. The first-generation distributed open-source stream processing engines, represented by Storm, sacrificed result accuracy for lower latency but couldn't guarantee "exactly-once" consistency. The second-generation stream processors, with the Lambda architecture as the core, combined the first-generation stream processors with traditional batch processors to ensure both low latency and high accuracy,but was had difficulty in establishment and maintenance. As a representative of the third-generation stream processors, Flink, with the Kappa architecture as its core, not only inherits the merit of the second but also features high throughput, high availability, and effectively enables the reuse of stream processing results during batch processing. Boshra Pishgoo [4] and others proposed a Hybrid Distributed Batch-Stream (HDBS) architecture for real-time data anomaly detection, demonstrating that this architecture can guarantee accuracy like batch processing while maintaining quickness like stream processing.
In recent years, the Lakehouse [5] has emerged as a new data management architecture, which combines the structure and management capability of data warehouse with the low-cost storage and flexibility of ‘data lake’. With the underlying storage in widely adopted data formats and the complex mechanisms of upper metadata layer aiming at transaction management, version control, and SQL operations, this architecture can overcome challenges such as inconsistent Lakehouse data, invalid data caused by ETL delays, weakness in complex analysis, and expensive costs, it can further provides a consistent interface for higher-level services such as business intelligence, reporting analysis, data science, and machine learning.
The ability to handle large-scale and high-concurrency response is also an assessment criterion for the platform. In the early days of the internet, for pursuit of simple implementation , most applications adopted monolithic architectures, where all business functionalities were developed, packaged, and deployed within a single project. But drawbacks including high coupling between functional modules impede maintenance and further development makes it incapable of handling high-concurrency access and storing massive amounts of data. In 2014, Martin Fowler [6] introduced the concept of microservice, effectively addressing these issues by adopting a single responsibility principle of breaking down application services into individuals developed and deployed respectively.
Based on these findings, we design and build a large power data platform utilizing big data frameworks such as Hadoop, Flink and Hudi, combined with Web application technology frameworks such as Vue [7] and SpringCloud [8], and adopt the technical architecture of lake and warehouse integrated storage, micro-service, flow and batch integrated computation. Finally, a prototype system is developed based on the above platform, which realizes the management operation of user authority, data storage and update, dynamic data visualization and intelligent detection.